Predicting Who Will Leave and Why - Python Machine Learning

In this notebook, we do a brief exploration of our HR analytics data (found on Kaggle, which you can check for more info on the dataset) and try to discern which factors matter the most in determining why our personnel leave. The notebook will primarily be divided into two sections -- data analysis and machine learning.

Data Analysis


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Reading in the Data

First, let's read in and get an overview of the data we'll be working with.


In [2]:
hr_data = pd.read_csv('../input/HR_comma_sep.csv')
hr_data.head()


Out[2]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

In [3]:
hr_data.describe()


Out[3]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

In [4]:
hr_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Conveniently, there is no missing data. Given that the "sales' and "salary" columns are non-numeric, we can check the number of unique levels and dummy code the variables.


In [5]:
print('Departments: ', ', '.join(hr_data['sales'].unique()))
print('Salary levels: ', ', '.join(hr_data['salary'].unique()))


Departments:  sales, accounting, hr, technical, support, management, IT, product_mng, marketing, RandD
Salary levels:  low, medium, high

In [6]:
hr_data.rename(columns={'sales':'department'}, inplace=True)
hr_data_new = pd.get_dummies(hr_data, ['department', 'salary'] ,drop_first = True)

In [7]:
hr_data_new.head()


Out[7]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium
0 0.38 0.53 2 157 3 0 1 0 0 0 0 0 0 0 1 0 0 1 0
1 0.80 0.86 5 262 6 0 1 0 0 0 0 0 0 0 1 0 0 0 1
2 0.11 0.88 7 272 4 0 1 0 0 0 0 0 0 0 1 0 0 0 1
3 0.72 0.87 5 223 5 0 1 0 0 0 0 0 0 0 1 0 0 1 0
4 0.37 0.52 2 159 3 0 1 0 0 0 0 0 0 0 1 0 0 1 0

Observe that "IT" and "high" are the baseline levels for the assigned department and salary level, respectively. Also note that we saved the data with dummified variables as another dataframe in case we need to access the string values, such as for a cross-tabulation table.

Exploring the Data

Now that we have data in an analysis-friendly form, we can do some basic visualizations to spot any relationships in the data.


In [8]:
# Correlation matrix
sns.heatmap(hr_data.corr(), annot=True)


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa24efe550>

The matrix above shows that, generally speaking, the data is not correlated. This is good because it means we likely won't have issues with multicollinearity later.

It is notable, though perhaps unsurprising, that our employees' satisfaction level is the variable that is most highly correlated with them leaving.


In [9]:
hr_data_new.columns


Out[9]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'department_RandD', 'department_accounting',
       'department_hr', 'department_management', 'department_marketing',
       'department_product_mng', 'department_sales', 'department_support',
       'department_technical', 'salary_low', 'salary_medium'],
      dtype='object')

Let's first check if there are any particular departments that our people tend to be leaving from.


In [10]:
dept_table = pd.crosstab(hr_data['department'], hr_data['left'])
dept_table.index.names = ['Department']
dept_table


Out[10]:
left 0 1
Department
IT 954 273
RandD 666 121
accounting 563 204
hr 524 215
management 539 91
marketing 655 203
product_mng 704 198
sales 3126 1014
support 1674 555
technical 2023 697

We can check the above in terms of percentages to more easily see if there are particular departments that tend to have a higher proportion of people leaving.


In [11]:
dept_table_percentages = dept_table.apply(lambda row: (row/row.sum())*100, axis = 1)
dept_table_percentages


Out[11]:
left 0 1
Department
IT 77.750611 22.249389
RandD 84.625159 15.374841
accounting 73.402868 26.597132
hr 70.906631 29.093369
management 85.555556 14.444444
marketing 76.340326 23.659674
product_mng 78.048780 21.951220
sales 75.507246 24.492754
support 75.100942 24.899058
technical 74.375000 25.625000

R&D and management tend to have lower rates of leaving, and HR and accounting tend to have higher rates of leaving. The other departments are fairly similar, all between around 22 to 25 percent. We can also visualize the above data with a countplot.


In [12]:
sns.countplot(x='department', hue='left', data=hr_data)


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa1b4e0240>

In [13]:
sns.boxplot(x='department', y='satisfaction_level', data=hr_data)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa1b3cb5f8>

While there doesn't appear to be too much of a difference in the satisfaction, we notice that both HR and accounting, the departments that have the highest rates of leaving, have slightly lower median satisfaction levels than the rest of the departments.

Salary is likely to have a high impact on leaving. In fact, it is highly likely that both R&D and management, the two departments with the lowers leaving rates, have high salaries. Let's first check the relationship between leaving and salary.


In [14]:
sns.countplot(x='salary', hue='left', data=hr_data)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa1b21db00>

Confirming our hypothesis, those with low salaries tend to have the highest number of people that leave. Eyeballing the plot shows us that around 40% of those with low salaries leave and 25% of those with median salaries leave. It looks like only 10% of those with high salaries leave.

Let's also check the spread of satisfaction level between the different salary ranges.


In [15]:
sns.boxplot(x='salary', y='satisfaction_level', data=hr_data)


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa1b1cd470>

Again, in line with some of our prior observations, low salary has the lowest median satisfaction and the highest spread.

Something that may impact employee perception in the company is the number of projects they are assigned.


In [16]:
sns.factorplot(x='number_project', y='last_evaluation', hue='department', data=hr_data)


Out[16]:
<seaborn.axisgrid.FacetGrid at 0x7faa1b12c208>

It is very clear that evaluation scores are affected by the number of projects assigned to the employee. What's more, we again notice a peculiar trend in accounting -- they have a lower last_evaluation score than the other departments at 7 projects.


In [17]:
sns.boxplot(x='number_project', y='satisfaction_level', data=hr_data_new)


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faa1ae6ab70>

It looks like we've found a very important relationship -- those with high numbers of projects (6 or 7) tend to have extremely low satisfaction levels. This will likely play a role when we do our modeling. Also worth noting is that those with only 2 projects tend to also have lower satisfaction levels.

Let's take a look at time spent at the company and the effect of that on leaving. It was the third most correlated factor with leaving, so this should give us some usable information. We also check this in the context of two of the variables we previously studied, salary and department, to see if there are additional insights we can extract.


In [18]:
timeplot = sns.factorplot(x='time_spend_company', hue='left', y='department', row='salary', data=hr_data, aspect=2)


There is a clear trend for those with low and medium salaries -- those that leave tend to have spent more time at the company. For those with high salaries, leaving depends on the department. At the high salary level, time spent doesn't vary in accounting for those that left versus those that haven't but it varies pretty wildly for the support and IT departments.

Before we move on to the modeling section, let's take a look at accidents. This was the second most correlated factor with leaving, interestingly enough.


In [19]:
accidentplot = plt.figure(figsize=(10,6))
accidentplotax = accidentplot.add_axes([0,0,1,1])
accidentplotax = sns.violinplot(x='department', y='average_montly_hours', hue='Work_accident', split=True, data = hr_data, jitter = 0.47)


The difference is quite subtle, but the monthly hours (just noticed when I made this plot that the variable was spelled wrong in the dataset) seems to be bimodally distributed more often for those without work accidents versus those with.

Let's check a similar plot to see the relationship between leaving, work accidents, and satisfaction level.Let's check a similar plot to see the relationship between leaving, work accidents, and satisfaction level.


In [20]:
satisaccident = plt.figure(figsize=(10,6))
satisaccidentax = satisaccident.add_axes([0,0,1,1])
satisaccidentax = sns.violinplot(x='left', hue='Work_accident', y='satisfaction_level', split=True, data=hr_data)


What we see here is that there is a marked difference in the satisfaction level spreads of those that leave versus those that don't, with the peaks for those that left being slightly more pronounced for those that have not had workplace accidents, interestingly enough.

Machine Learning

Modeling

Let's model the data with a decision tree.


In [21]:
# We now use model_selection instead of cross_validation
from sklearn.model_selection import train_test_split

X = hr_data_new.drop('left', axis=1)
y = hr_data_new['left']

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size = 0.3, random_state = 47)

In [22]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)


Out[22]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

While we will, of course, make predictions on our test set, we treat that as a holdout set and first do some cross-validation on our training set.


In [23]:
from sklearn.model_selection import cross_val_score

# Score first on our training data
print('Score: ', dt.score(X_train, y_train))
print('Cross validation score, 10-fold cv: \n', cross_val_score(dt, X_train, y_train, cv=10))
print('Mean cross validation score: ', cross_val_score(dt,X_train,y_train,cv=10).mean())


Score:  1.0
Cross validation score, 10-fold cv: 
 [ 0.97240723  0.97906755  0.97050428  0.98190476  0.97047619  0.97047619
  0.97330791  0.97807436  0.97044805  0.97140133]
Mean cross validation score:  0.974284156008

Our results are very good; showing a consistently high score for all folds of our 10-fold cross-validation using the training data. Let's make predictions and check the performance of the model on the holdout set in the same manner.


In [24]:
predictions = dt.predict(X_test)

print('Score: ', dt.score(X_test, y_test))
print('Cross validation score, 10-fold cv: \n', cross_val_score(dt, X, y, cv=10))
print('Mean cross validation score: ', cross_val_score(dt,X,y,cv=10).mean())


Score:  0.978
Cross validation score, 10-fold cv: 
 [ 0.98267821  0.98733333  0.968       0.96466667  0.962       0.98066667
  0.99        0.99266667  1.          1.        ]
Mean cross validation score:  0.981801154786

Once again, the model performs very well. Let's check on some additional classification metrics to see, in more detail, how our model does.

Evaluation


In [25]:
from sklearn.metrics import confusion_matrix, classification_report

print('Confusion matrix: \n', confusion_matrix(y_test, predictions), '\n')
print('Classification report: \n', classification_report(y_test, predictions))


Confusion matrix: 
 [[3376   66]
 [  33 1025]] 

Classification report: 
              precision    recall  f1-score   support

          0       0.99      0.98      0.99      3442
          1       0.94      0.97      0.95      1058

avg / total       0.98      0.98      0.98      4500

On the basis of our 4500 test samples, our model is very accurate, with only 98 test cases wrong (or only around 2.2% wrong). All of the other metrics -- precision, recall, f1-score -- are also very good. Our model also doesn't appear to display any inherent bias in predicting one class.

We can also take a look at the ROC curve to determine the effectiveness of the test at correctly classifying those who stay and those who leave.


In [26]:
from sklearn.metrics import roc_curve, roc_auc_score
probabilities = dt.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, probabilities[:,1])

rates = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr})

roc = plt.figure(figsize = (10,6))
rocax = roc.add_axes([0,0,1,1])
rocax.plot(fpr, tpr, color='g', label='Decision Tree')
rocax.plot([0,1],[0,1], color='gray', ls='--', label='Baseline (Random Guessing)')
rocax.set_xlabel('False Positive Rate')
rocax.set_ylabel('True Positive Rate')
rocax.set_title('ROC Curve')
rocax.legend()

print('Area Under the Curve:', roc_auc_score(y_test, probabilities[:,1]))


Area Under the Curve: 0.974817087705

With a very high area under the curve of 0.977, our model is excellent at discriminating between those who stay and those who leave.

Let's check out the most important features, or those that are most influential in determining whether an employee leaves (or stays) in our company.


In [27]:
importances = dt.feature_importances_
print("Feature importances: \n")
for f in range(len(X.columns)):
    print('•', X.columns[f], ":", importances[f])


Feature importances: 

• satisfaction_level : 0.497565897987
• last_evaluation : 0.1461780216
• number_project : 0.105077754324
• average_montly_hours : 0.0921810351156
• time_spend_company : 0.13973536412
• Work_accident : 0.00143842147562
• promotion_last_5years : 0.000492724762216
• department_RandD : 0.000120529632333
• department_accounting : 0.000897045698658
• department_hr : 0.000713244271485
• department_management : 0.000610885472822
• department_marketing : 0.0011463137373
• department_product_mng : 0.000515881952772
• department_sales : 0.00172809387009
• department_support : 0.00140616352533
• department_technical : 0.00179546752072
• salary_low : 0.00652253448209
• salary_medium : 0.00187462045182

To make it easier to interpret, we can order these from most important to least.


In [28]:
featureswithimportances = list(zip(X.columns, importances))
featureswithimportances.sort(key = lambda f: f[1], reverse=True)

print('Ordered feature importances: \n', '(From most important to least important)\n')

for f in range(len(featureswithimportances)):
    print(f+1,". ", featureswithimportances[f][0], ": ", featureswithimportances[f][1])


Ordered feature importances: 
 (From most important to least important)

1 .  satisfaction_level :  0.497565897987
2 .  last_evaluation :  0.1461780216
3 .  time_spend_company :  0.13973536412
4 .  number_project :  0.105077754324
5 .  average_montly_hours :  0.0921810351156
6 .  salary_low :  0.00652253448209
7 .  salary_medium :  0.00187462045182
8 .  department_technical :  0.00179546752072
9 .  department_sales :  0.00172809387009
10 .  Work_accident :  0.00143842147562
11 .  department_support :  0.00140616352533
12 .  department_marketing :  0.0011463137373
13 .  department_accounting :  0.000897045698658
14 .  department_hr :  0.000713244271485
15 .  department_management :  0.000610885472822
16 .  department_product_mng :  0.000515881952772
17 .  promotion_last_5years :  0.000492724762216
18 .  department_RandD :  0.000120529632333

In [60]:
sorted_features, sorted_importances = zip(*featureswithimportances)
plt.figure(figsize=(12,6))
sns.barplot(sorted_features, sorted_importances)
plt.title('Feature Importances (Gini Importance)')
plt.ylabel('Decrease in Node Impurity')
plt.xlabel('Feature')
plt.xticks(rotation=90);


Most of the variables we studied a while ago, including satisfaction_level, last_evaluation, time_spend_company, number_project, and average_montly_hours, appear to be important. By studying the relationships of these variables between those who have left and those who haven't, we can more accurately determine who's leaving and why.

Interestingly, both salary and department appeared to have a relatively small effect in our decision tree model. This could be caused by the fact that the preceding 5 factors already more accurately describe the conditions of the person who will leave, regardless of department, or the fact that the preceding 5 factors are already strongly correlated enough with salary and/or department to reduce their importance in our final model.

Conclusion

It's definitely worth taking our employees' satisfaction levels more seriously. We've discovered that this is related to, among other things, their salary and the number of projects they have. Further study could lead to finding an optimal combination of salary, number of projects, and other important factors in taking care of our people that could lead to better performance and profits for us and a lower employee mortality rate. It's also worth noting that time spent at the company and employee evaluations also have an important effect on whether employees leave or not -- this could ultimately be connected to their work, so it's worth investigating in more detail how departments handle project delegation to their employees and what kinds of projects they're given, especially given that those from HR and accounting tend to have higher leave rates than the other functions.